NeuroVault, (https://neurovault.org/), is an open repository for sharing unthresholded statistical maps, parcellations, and atlases of the human brain. People share the brain images that they use for their studies to the website, to allow for interactive visualization of images while also contributing data entries for meta-analysis. This project is a meta-analysis over the available brain images on NeuroVault.
However, since investigators use the images for different purposes, retrieve images from different sources (e.g. different MRI machines), the images are not calibrated for collective analysis. This is why a preprocessing pipeline is necessary. In this project, we standardize the images to a common space (MNI 2mm), resolution, and intensity scale (RESI) to allow for further meta-analysis of the valuable image data. Since images on NeuroVault is constantly updating up until today, we aim to have this preprocessing pipeline updated monthly to capture all newly added images, to gradually accumulate as a comprehensive preprocessed dataset for future use.
library(parallel)
library(RNifti)
library(pbj)
library(abind)
library(dplyr)
library(table1)
library(GGally)
library(ggplot2)
library(httr)
library(jsonlite)
library(ComplexUpset)
library(tidyr)
knitr::knit_hooks$set(GPs=function(before, options, envir){
if (before){
cex=1.5
par(mgp=c(1.8,.7,0), lwd=2, lend=2,
cex.lab=cex, cex.axis=cex, cex.main=1*cex,
mar=c(2.8,3,1.8,.5), bty='l', oma=c(0,0,2,0))}
})
knitr::opts_chunk$set(echo = TRUE, fig.height = 7.5, fig.width = 7.5, GPs=TRUE, cache=FALSE)
# visualize mean image
template = "/usr/local/fsl/data/standard/MNI152_T1_2mm_brain.nii.gz"
template_mask = "/usr/local/fsl/data/standard/MNI152_T1_2mm_brain_mask.nii.gz"
# this is the abide study mask
abidemaskfile = file.path('/media/disk2/home/vandeks/vandekar_R01_renewal_2024/analyses', 'abide/neuroimaging/cpac/mask_2mm.nii.gz')
# this is the neurovault mask
maskfile='/media/disk2/home/vandeks/vandekar_R01_renewal_2024/analyses/mask.nii.gz'
# for the analysis, a combined mask is used.
# Number of cores
n_cores <- 24The Neurovault metadata include manually entered information at the study- and image-level describing the type of data that were uploaded. The Neurovault data contain the following variables that we use in the data preprocessing:
analysis_level: indicates whether the image is a
group-level or individual-level statistical map.not_mni: indicates whether the image is in MNI
space.is_thresholded: indicates whether the image is
thresholded or unthresholded.map_type: indicates the statistical type of the image
(e.g., T, Z, F).perc_bad_voxels: Proportion of voxels with 0 intensity
or NaN values.perc_voxels_outside: Proportion of voxels outside the
brain.number_of_subjects: Sample size of the studyWe download the Neurovault study-level and image-level metadata, then apply quality exclusionary criteria to the metadata to ensure uniformity of the group-level statistical images. Specifically, we:
analysis_level == "group").not_mni == FALSE).is_thresholded == FALSE),
which ensures all statistical values computed in the image are included.
This avoids biased caused by only reporting significant resultsSys.setenv(CURL_CA_BUNDLE = "/etc/ssl/certs/ca-certificates.crt")
httr::GET("https://api.github.com")
url <- 'https://neurovault.org/api/collections/?format=json'
nv = list()
i = 0
while (!is.null(url)) {
i = i + 1
cat(i, '\t')
response <- GET(url, config(cainfo="/etc/ssl/certs/ca-certificates.crt"))
data <- content(response, "text")
json_data <- fromJSON(data, flatten = TRUE)
nv[[i]] = json_data$results
url <- json_data[['next']]
}
nv = do.call(rbind, nv)
# number of repos with DOIs
#sum(!is.na(nv$DOI))
# subset to studies that have more than 0 images
nv = nv[nv$number_of_images>0,]
saveRDS(nv, file='neurovaultMetadata.rds')# same thing for the image data
url <- paste0("https://neurovault.org/api/images/?format=json")
results = list()
i = 0
while (!is.null(url)) {
i = i + 1
cat(i, '\t')
response <- GET(url)
data <- content(response, "text")
json_data <- fromJSON(data, flatten = TRUE)
results[[i]] = json_data$results
url <- json_data[['next']]
}
saveRDS(results, file='imageMetadata.rds')
images = lapply(nv$id, function(collection_id){
message(collection_id)
url <- paste0("neurovault.org/api/collections/", collection_id, "/images")
response <- GET(url)
data <- content(response, "text")
json_data <- fromJSON(data, flatten = TRUE)
json_data$results
})
# Keep only data frames with ≥53 columns
images <- images[ sapply(images, is.data.frame) & sapply(images, ncol) >= 53 ]
# Split into chunks
chunked_images <- split(images, sort(seq_along(images) %% n_cores))
# Tracker function
progress_fun <- function(i, total) {
cat(sprintf("Chunk %d of %d merged [%s]\n", i, total, Sys.time()))
}
# Merge within each chunk (in parallel)
partial_merged <- mclapply(seq_along(chunked_images), function(i) {
chunk <- chunked_images[[i]]
merged_chunk <- Reduce(function(x, y) merge(x, y, all = TRUE), chunk)
progress_fun(i, length(chunked_images))
return(merged_chunk)
}, mc.cores = n_cores)
# Merge final chunks (sequentially — usually < 8, so fast)
cat("Final merge of all chunks...\n")
images_merged <- Reduce(function(x, y) merge(x, y, all = TRUE), partial_merged)
# Save
saveRDS(images_merged, "fullimageMetadata.rds") # This methodology gives a dataset with more than 1000 columnslibrary(forcats)
library(grid)
library(ComplexUpset)
images <- readRDS("fullimageMetadata.rds")
images <- images %>% mutate(
image_type = factor(image_type, levels = names(sort(table(image_type), decreasing = TRUE))),
map_type = factor(map_type, levels = names(sort(table(map_type), decreasing = TRUE))),
analysis_level = factor(analysis_level, levels = names(sort(table(analysis_level), decreasing = TRUE))),
is_thresholded = factor(is_thresholded, levels = names(sort(table(is_thresholded), decreasing = TRUE))),
not_mni = factor(not_mni, levels = names(sort(table(not_mni), decreasing = TRUE)))
)
table1(~ image_type + map_type + analysis_level + is_thresholded + not_mni,
data = images,
caption = "Table 1: Raw collection of all data")| Overall (N=543699) |
|
|---|---|
| image_type | |
| statistic_map | 543000 (99.9%) |
| NIDM results statistic map | 681 (0.1%) |
| atlas | 18 (0.0%) |
| map_type | |
| univariate-beta map | 152257 (28.0%) |
| variance | 138378 (25.5%) |
| Z map | 99864 (18.4%) |
| other | 67989 (12.5%) |
| T map | 67563 (12.4%) |
| ROI/mask | 9509 (1.7%) |
| P map (given null hypothesis) | 4502 (0.8%) |
| parcellation | 1068 (0.2%) |
| F map | 1031 (0.2%) |
| 1-P map ("inverted" probability) | 533 (0.1%) |
| anatomical | 477 (0.1%) |
| multivariate-beta map | 467 (0.1%) |
| Chi squared map | 43 (0.0%) |
| Missing | 18 (0.0%) |
| analysis_level | |
| single-subject | 408986 (75.2%) |
| group | 45479 (8.4%) |
| meta-analysis | 10184 (1.9%) |
| other | 2158 (0.4%) |
| 27 (0.0%) | |
| Missing | 76865 (14.1%) |
| is_thresholded | |
| FALSE | 482019 (88.7%) |
| TRUE | 54943 (10.1%) |
| Missing | 6737 (1.2%) |
| not_mni | |
| FALSE | 541415 (99.6%) |
| TRUE | 2266 (0.4%) |
| Missing | 18 (0.0%) |
## UpSet table
images$binarized_group_level <- ifelse(images$analysis_level == "group", TRUE, FALSE)
images$binarized_map_type <- ifelse(images$map_type == "T map" | images$map_type == "Z map", TRUE, FALSE)
images$unthresholded <- ifelse(images$is_thresholded == FALSE, TRUE, FALSE)
images$mni <- ifelse(images$not_mni == FALSE, TRUE, FALSE)
images_clean <- images %>%
transmute(
group_level = as.logical(binarized_group_level),
T_or_Z_maps = as.logical(binarized_map_type),
Unthresholded = as.logical(unthresholded),
Mni_space = as.logical(mni)
)
upset(
images_clean,
intersect = c("group_level", "T_or_Z_maps", "Unthresholded", "Mni_space"),
min_size = 50,
base_annotations = list('Intersection size' = intersection_size())
)Figure 1: Overlap of NeuroVault Image Characteristics
Note:
analysis_level variable has other types of data,
and group-level images make up less than 10% of data.images <- readRDS("fullimageMetadata.rds")
# The following starts the trimming progress
# sort columns by nonmissingness
images = images[ ,order(colMeans(is.na(images)))]
# get the first 47 columns, which seemed to make sense
images = images[,1:47]
# subset to group-level analysis
images = images[ which(images$analysis_level=='group'), ]
# subset to unthresholded images
images = images[which(images$is_thresholded==FALSE),]
# Including only T and Z map types
images = images[images$map_type %in% c("T map", "Z map"), ]
# Exclude all not mni
images = images[which(images$not_mni == FALSE), ]
# Exclude sample size 0
images <- subset(images, !is.na(number_of_subjects))
table1(~ image_type + map_type + analysis_level + is_thresholded + not_mni + number_of_subjects, data = images, caption = "Table 2: Preliminary sorting")| Overall (N=17796) |
|
|---|---|
| image_type | |
| statistic_map | 17796 (100%) |
| NIDM results statistic map | 0 (0%) |
| atlas | 0 (0%) |
| map_type | |
| univariate-beta map | 0 (0%) |
| variance | 0 (0%) |
| Z map | 2419 (13.6%) |
| other | 0 (0%) |
| T map | 15377 (86.4%) |
| ROI/mask | 0 (0%) |
| P map (given null hypothesis) | 0 (0%) |
| parcellation | 0 (0%) |
| F map | 0 (0%) |
| 1-P map ("inverted" probability) | 0 (0%) |
| anatomical | 0 (0%) |
| multivariate-beta map | 0 (0%) |
| Chi squared map | 0 (0%) |
| analysis_level | |
| single-subject | 0 (0%) |
| group | 17796 (100%) |
| meta-analysis | 0 (0%) |
| other | 0 (0%) |
| 0 (0%) | |
| is_thresholded | |
| FALSE | 17796 (100%) |
| TRUE | 0 (0%) |
| not_mni | |
| FALSE | 17796 (100%) |
| TRUE | 0 (0%) |
| number_of_subjects | |
| Mean (SD) | 1880 (242000) |
| Median [Min, Max] | 27.0 [1.00, 32200000] |
Here, we download Nifti images from NeuroVault collections after applying the metadata exclusions. For each collection, it fetches associated images via API calls. After preparing local file paths, it checks for missing files and downloads them in parallel using 36 cores.
images <- readRDS("imageMetadata.rds")
# download the images
outdir = '/media/big/neurovault'
images$localfile = gsub('http://neurovault.org/media/images', outdir, images$file)
# creates the directories
invisible(capture.output(lapply(unique(dirname(images$localfile)), dir.create, showWarnings=FALSE, recursive = TRUE)))
# Actually download the images, here, if they aren't already
# downloaded. I am not sure if there are usage limits on
# neurovault. This parallel call would probably be a problem
# if there is. There were issues the first time.
images$download = !file.exists(images$localfile)
# tryCatch to ignore errors
downloadFile = function(url, destfile){ tryCatch(download.file(url=url, destfile=destfile), error=function(e){NA})}
mcmapply(downloadFile, url=images$file[ images$download==TRUE], destfile=images$localfile[ images$download==TRUE], mc.cores = n_cores)
# recheck what files exist
images$download = !file.exists(images$localfile)
saveRDS(images, 'imageMetadata.rds')images = readRDS('imageMetadata.rds')
images$target_template_image <- factor(images$target_template_image,
levels = names(sort(table(images$target_template_image), decreasing = TRUE)))
images$image_type <- factor(images$image_type,
levels = names(sort(table(images$image_type), decreasing = TRUE)))
images$map_type <- factor(images$map_type,
levels = names(sort(table(images$map_type), decreasing = TRUE)))
images$analysis_level <- factor(images$analysis_level,
levels = names(sort(table(images$analysis_level), decreasing = TRUE)))
images$is_thresholded <- factor(images$is_thresholded,
levels = names(sort(table(images$is_thresholded), decreasing = TRUE)))
images$not_mni <- factor(images$not_mni,
levels = names(sort(table(images$not_mni), decreasing = TRUE)))
table1(~ target_template_image + image_type + map_type + analysis_level + is_thresholded + not_mni + perc_bad_voxels + brain_coverage + perc_voxels_outside, data = images, caption = "Table 3: image data overview")| Overall (N=17796) |
|
|---|---|
| target_template_image | |
| GenericMNI | 16125 (90.6%) |
| MNI152NLin2009cAsym | 1631 (9.2%) |
| NMT | 39 (0.2%) |
| Dorr2008 | 1 (0.0%) |
| image_type | |
| statistic_map | 17796 (100%) |
| NIDM results statistic map | 0 (0%) |
| atlas | 0 (0%) |
| map_type | |
| T map | 15377 (86.4%) |
| Z map | 2419 (13.6%) |
| univariate-beta map | 0 (0%) |
| variance | 0 (0%) |
| other | 0 (0%) |
| ROI/mask | 0 (0%) |
| P map (given null hypothesis) | 0 (0%) |
| parcellation | 0 (0%) |
| F map | 0 (0%) |
| 1-P map ("inverted" probability) | 0 (0%) |
| anatomical | 0 (0%) |
| multivariate-beta map | 0 (0%) |
| Chi squared map | 0 (0%) |
| analysis_level | |
| group | 17796 (100%) |
| single-subject | 0 (0%) |
| meta-analysis | 0 (0%) |
| other | 0 (0%) |
| 0 (0%) | |
| is_thresholded | |
| FALSE | 17796 (100%) |
| TRUE | 0 (0%) |
| not_mni | |
| FALSE | 17796 (100%) |
| TRUE | 0 (0%) |
| perc_bad_voxels | |
| Mean (SD) | 72.0 (13.3) |
| Median [Min, Max] | 74.7 [0, 86.1] |
| brain_coverage | |
| Mean (SD) | 93.3 (11.7) |
| Median [Min, Max] | 98.6 [2.66, 100] |
| perc_voxels_outside | |
| Mean (SD) | 18.7 (23.1) |
| Median [Min, Max] | 15.5 [0, 100] |
After downloading the images, we compute image dimensions, voxel
dimensions, number of nonzero voxels, image intensity range, and
orientation from the Nifti images. The image and voxel dimensions are
used to compute the physical dimensions of each image in millimeters. We
screen for duplicate images based on both the qualitative and
quantitative aspects, including perc_bad_voxels,
imgdim, voxeldim, nvox,
range, and bounding_box. It would be rare for
two different images to have exact identical values across all these
metrics, especially bounding_box.
library(RNifti)
library(parallel)
library(papayaWidget)
# check dimensions of voxels and generating the voxel and image dimensions
images = readRDS('imageMetadata.rds')
images = images[ !images$download,]
cmd = paste('/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d', images$localfile, '-info')
voxdim = mclapply(cmd, system, intern=TRUE, mc.cores=n_cores)
images$voxdim = unlist(lapply(voxdim, function(x){splt = strsplit(x[1], split = ';')[[1]]; splt[grep('vox', splt)] }))
images$imgdim = unlist(lapply(voxdim, function(x){splt = strsplit(x[1], split = ';')[[1]]; splt[grep('dim', splt)] }))
images$voxdim <- trimws(images$voxdim)
images$imgdim <- trimws(images$imgdim)
# Count non-zero voxels
cmd_vox <- paste('/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d', images$localfile, '-threshold 0 0 0 1 -voxel-sum')
nvox <- unlist(mclapply(cmd_vox, system, intern = TRUE, mc.cores = n_cores))
images$nvox <- as.numeric(gsub("[^0-9.]", "", nvox)) # allow decimals just in case
# Get orientation, range, and bounding box from `-info` output
cmd_info <- paste('/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d', images$localfile, '-info')
info <- mclapply(cmd_info, system, intern = TRUE, mc.cores = n_cores)
# Helper function to extract value by keyword from a string
extract_info_field <- function(lines, keyword) {
val <- sapply(lines, function(x) {
first_line <- x[1]
splt <- strsplit(first_line, split = ";")[[1]]
match <- grep(keyword, splt, value = TRUE)
if (length(match) > 0) {
return(trimws(sub(paste0("^.*", keyword, "\\s*=\\s*"), "", match[1])))
} else {
return(NA)
}
})
return(val)
}
images$orient <- extract_info_field(info, "orient")
images$range <- extract_info_field(info, "range")
images$bounding_box <- extract_info_field(info, "bb")
images$voxdim <- trimws(gsub("vox = ", "", images$voxdim))
images$imgdim <- trimws(gsub("Image #1: dim = ", "", images$imgdim))
# Save
saveRDS(images, "imageMetadataSubset.rds")images <- readRDS("imageMetadataSubset.rds")
library(data.table)
if (is.list(images) && !is.data.frame(images)) {
keep <- vapply(images, function(x) is.data.frame(x) && nrow(x) > 0, logical(1))
DF <- rbindlist(images[keep], use.names = TRUE, fill = TRUE)
}
setDT(DF)
DF[, row_id := .I] # keep original row index
# Add basename(localfile)
DF[, localfile_base := basename(as.character(localfile))]
# Choose candidate key columns (use only those that exist)
cand_keys <- c("perc_bad_voxels","brain_coverage","range","bounding_box",
"nvox","imgdim","localfile_base")
# Build a key-only copy and normalize any list columns
keyDF <- copy(DF[, ..cand_keys])
# Turn list-like columns into comparable strings
list_cols <- names(Filter(is.list, keyDF))
keyDF[, (list_cols) := lapply(.SD, function(x)
vapply(x, function(z) paste(as.character(z), collapse = "x"), character(1))
), .SDcols = list_cols]
# Create a stable group key for analysis (using normalized keyDF)
grp_key <- do.call(paste, c(keyDF, sep = "|"))
DF[, grp_key := grp_key]
# Flag all rows that belong to a duplicated group
dup_flag <- duplicated(keyDF) | duplicated(keyDF, fromLast = TRUE)
# The full duplicates dataset for analysis
dups <- DF[dup_flag]
dups[, dup_count := .N, by = grp_key]
setorder(dups, -dup_count)
# Quick summary of duplicate groups for spot checking
dup_groups <- unique(dups[, .(grp_key, dup_count)])[order(-dup_count)]
# DEDUPLICATION: Use consistent approach with normalized keyDF
keep_rows <- !duplicated(keyDF)
DF_dedup <- DF[keep_rows]
# Verify results
cat("Original rows:", nrow(DF), "\n")
cat("After deduplication:", nrow(DF_dedup), "\n")
cat("Duplicates removed:", nrow(DF) - nrow(DF_dedup), "\n")
# Remove temporary analysis columns if desired
DF_dedup[, c("grp_key") := NULL]
saveRDS(DF_dedup, 'imageMetadataSubset2.2.rds')images = readRDS('imageMetadataSubset2.2.rds')
library(table1)
## factorize so that it appears in descending order
images$voxdim <- as.factor(images$voxdim)
voxdim_freq <- table(images$voxdim)
voxdim_sorted <- names(sort(voxdim_freq, decreasing = TRUE))
images$voxdim <- factor(images$voxdim, levels = voxdim_sorted)
images$map_type <- as.factor(images$map_type)
map_type_freq <- table(images$map_type)
map_type_sorted <- names(sort(map_type_freq, decreasing = TRUE))
images$map_type <- factor(images$map_type, levels = map_type_sorted)
images$imgdim <- as.factor(images$imgdim)
imgdim_freq <- table(images$imgdim)
imgdim_sorted <- names(sort(imgdim_freq, decreasing = TRUE))
images$imgdim <- factor(images$imgdim, levels = imgdim_sorted)
images$bounding_box <- as.factor(images$bounding_box)
bbox_freq <- table(images$bounding_box)
bbox_sorted <- names(sort(bbox_freq, decreasing = TRUE))
images$bounding_box <- factor(images$bounding_box, levels = bbox_sorted)
table1(~ imgdim +voxdim + nvox + map_type + perc_bad_voxels + perc_voxels_outside + bounding_box + number_of_subjects, data = images, caption = "Table 4: Raw data and dimension inspection")images <- readRDS('MeaningfulimageMetadataSubset3.1.rds')
# Split dim_mm into x, y, z components
dims_split <- strsplit(images$dim_mm, "\\*")
dims_mat <- do.call(rbind, lapply(dims_split, as.numeric))
colnames(dims_mat) <- c("dim_x", "dim_y", "dim_z")
# Compute ratios
ratio_xy <- dims_mat[,1] / dims_mat[,2]
ratio_xz <- dims_mat[,1] / dims_mat[,3]
ratio_yz <- dims_mat[,2] / dims_mat[,3]
# Define reference ratios (from MNI: 182x218x182)
ref_xy <- 182 / 218 # ≈ 0.8349
ref_xz <- 1 # since x = z
ref_yz <- 218 / 182 # ≈ 1.1978
# Set epsilon
epsilon <- 0.025
# Logical mask for proportional dimensions
is_proportional <- abs(ratio_xy - ref_xy) < epsilon &
abs(ratio_xz - ref_xz) < epsilon &
abs(ratio_yz - ref_yz) < epsilon
# Subset original data
images_subset <- images[is_proportional, ]
#sort(table(images_subset$dim_mm),decreasing = TRUE)
images_subset$dim_mm <- as.factor(images_subset$dim_mm)
dim_mm_freq <- table(images_subset$dim_mm)
dim_mm_sorted <- names(sort(dim_mm_freq, decreasing = TRUE))
images_subset$dim_mm <- factor(images_subset$dim_mm, levels = dim_mm_sorted)
saveRDS(images_subset, "proportionalimagesubset_Section3.1.2.rds")
prop_near_mni <- nrow(images_subset)
mni_total <- nrow(images)
table1(~ dim_mm + nvox + map_type + perc_bad_voxels + perc_voxels_outside, data = images_subset, caption = "Table 4: Milimeter data distribution of subsetted dataset")| Overall (N=14087) |
|
|---|---|
| dim_mm | |
| 182*218*182 | 3200 (22.7%) |
| 195*231*196 | 2459 (17.5%) |
| 194*230*194 | 2085 (14.8%) |
| 195*231*195 | 1290 (9.2%) |
| 192*228*192 | 1092 (7.8%) |
| 158*190*158 | 806 (5.7%) |
| 195*232.5*195 | 698 (5.0%) |
| 181.5*217.5*181.5 | 487 (3.5%) |
| 196.749*230.755*194.4 | 413 (2.9%) |
| 183*219*183 | 321 (2.3%) |
| 194*230*194.4 | 288 (2.0%) |
| 159*189*156 | 267 (1.9%) |
| 184.25*220*184 | 72 (0.5%) |
| 157.5*192.5*157.5 | 68 (0.5%) |
| 180*216*180 | 63 (0.4%) |
| 158*190*156 | 59 (0.4%) |
| 194.4*230.4*194.4 | 59 (0.4%) |
| 194*230*195 | 45 (0.3%) |
| 196.218*231.894*196.42 | 33 (0.2%) |
| 181*217*181 | 24 (0.2%) |
| 180*217.5*180 | 20 (0.1%) |
| 193.2*229.2*193.2 | 20 (0.1%) |
| 188*220*184 | 19 (0.1%) |
| 182*217*182 | 18 (0.1%) |
| 157.5*190.5*157.5 | 15 (0.1%) |
| 196*232*197.6 | 15 (0.1%) |
| 194.247*230.346*195 | 14 (0.1%) |
| 195.8*231*195.8 | 14 (0.1%) |
| 182.5*220*182.5 | 13 (0.1%) |
| 194*230*195.8 | 12 (0.1%) |
| 192.5*229.25*192.5 | 11 (0.1%) |
| 196.552*231.384*195.25 | 10 (0.1%) |
| 152*184*152 | 9 (0.1%) |
| 157.5*190*157.5 | 8 (0.1%) |
| 156.6*189*158.4 | 6 (0.0%) |
| 196.472*232.412*194.4 | 6 (0.0%) |
| 198.75*232.5*197.6 | 5 (0.0%) |
| 198*231*195 | 5 (0.0%) |
| 182*218*183 | 4 (0.0%) |
| 193.5*229.5*195 | 4 (0.0%) |
| 195*231*196.56 | 4 (0.0%) |
| 157*189*156 | 2 (0.0%) |
| 164*192*160 | 2 (0.0%) |
| 169.5*205.5*169.5 | 2 (0.0%) |
| 182.5*217.5*182.5 | 2 (0.0%) |
| 192.5*230*192.5 | 2 (0.0%) |
| 195.468*230.373*195.5 | 2 (0.0%) |
| 195.966*233.784*197.2 | 2 (0.0%) |
| 195*232.5*198 | 2 (0.0%) |
| 159*192*159 | 1 (0.0%) |
| 169*205*169 | 1 (0.0%) |
| 180*219*184 | 1 (0.0%) |
| 183*219*183.6 | 1 (0.0%) |
| 184.8*220.8*182.4 | 1 (0.0%) |
| 188*224*184 | 1 (0.0%) |
| 190.9*227.7*190.9 | 1 (0.0%) |
| 192.5*230.3125*192.5 | 1 (0.0%) |
| 193.6*228.8*193.6 | 1 (0.0%) |
| 193*229*193 | 1 (0.0%) |
| nvox | |
| Mean (SD) | 168000 (113000) |
| Median [Min, Max] | 169000 [1.00, 903000] |
| map_type | |
| univariate-beta map | 0 (0%) |
| variance | 0 (0%) |
| Z map | 2274 (16.1%) |
| other | 0 (0%) |
| T map | 11813 (83.9%) |
| ROI/mask | 0 (0%) |
| P map (given null hypothesis) | 0 (0%) |
| parcellation | 0 (0%) |
| F map | 0 (0%) |
| 1-P map ("inverted" probability) | 0 (0%) |
| anatomical | 0 (0%) |
| multivariate-beta map | 0 (0%) |
| Chi squared map | 0 (0%) |
| perc_bad_voxels | |
| Mean (SD) | 73.6 (10.8) |
| Median [Min, Max] | 75.2 [0, 87.2] |
| perc_voxels_outside | |
| Mean (SD) | 18.4 (23.5) |
| Median [Min, Max] | 15.5 [0, 100] |
Using a proportionality tolerance of \(\varepsilon\) = 0.025, we find that 88% of images are approximately proportional to the MNI template.
images <- readRDS('MeaningfulimageMetadataSubset3.1.rds')
# Split dim_mm into x, y, z components
dims_split <- strsplit(images$dim_mm, "\\*")
dims_mat <- do.call(rbind, lapply(dims_split, as.numeric))
colnames(dims_mat) <- c("dim_x", "dim_y", "dim_z")
# Compute ratios
ratio_xy <- dims_mat[,1] / dims_mat[,2]
ratio_xz <- dims_mat[,1] / dims_mat[,3]
ratio_yz <- dims_mat[,2] / dims_mat[,3]
# Define reference ratios (from MNI: 182x218x182)
ref_xy <- 182 / 218 # ≈ 0.8349
ref_xz <- 1 # since x = z
ref_yz <- 218 / 182 # ≈ 1.1978
# Set epsilon
epsilon <- 0.025
# Logical mask for proportional dimensions
is_proportional <- abs(ratio_xy - ref_xy) < epsilon &
abs(ratio_xz - ref_xz) < epsilon &
abs(ratio_yz - ref_yz) < epsilon
# Subset original data
images_subset <- images[is_proportional, ]
#sort(table(images_subset$dim_mm),decreasing = TRUE)
images_subset$dim_mm <- as.factor(images_subset$dim_mm)
dim_mm_freq <- table(images_subset$dim_mm)
dim_mm_sorted <- names(sort(dim_mm_freq, decreasing = TRUE))
images_subset$dim_mm <- factor(images_subset$dim_mm, levels = dim_mm_sorted)
saveRDS(images_subset, "proportionalimagesubset_Section3.1.2.rds")
table1(~ dim_mm + nvox + map_type + perc_bad_voxels + perc_voxels_outside, data = images_subset, caption = "Table 5: Milimeter data distribution of subsetted dataset")| Overall (N=14087) |
|
|---|---|
| dim_mm | |
| 182*218*182 | 3200 (22.7%) |
| 195*231*196 | 2459 (17.5%) |
| 194*230*194 | 2085 (14.8%) |
| 195*231*195 | 1290 (9.2%) |
| 192*228*192 | 1092 (7.8%) |
| 158*190*158 | 806 (5.7%) |
| 195*232.5*195 | 698 (5.0%) |
| 181.5*217.5*181.5 | 487 (3.5%) |
| 196.749*230.755*194.4 | 413 (2.9%) |
| 183*219*183 | 321 (2.3%) |
| 194*230*194.4 | 288 (2.0%) |
| 159*189*156 | 267 (1.9%) |
| 184.25*220*184 | 72 (0.5%) |
| 157.5*192.5*157.5 | 68 (0.5%) |
| 180*216*180 | 63 (0.4%) |
| 158*190*156 | 59 (0.4%) |
| 194.4*230.4*194.4 | 59 (0.4%) |
| 194*230*195 | 45 (0.3%) |
| 196.218*231.894*196.42 | 33 (0.2%) |
| 181*217*181 | 24 (0.2%) |
| 180*217.5*180 | 20 (0.1%) |
| 193.2*229.2*193.2 | 20 (0.1%) |
| 188*220*184 | 19 (0.1%) |
| 182*217*182 | 18 (0.1%) |
| 157.5*190.5*157.5 | 15 (0.1%) |
| 196*232*197.6 | 15 (0.1%) |
| 194.247*230.346*195 | 14 (0.1%) |
| 195.8*231*195.8 | 14 (0.1%) |
| 182.5*220*182.5 | 13 (0.1%) |
| 194*230*195.8 | 12 (0.1%) |
| 192.5*229.25*192.5 | 11 (0.1%) |
| 196.552*231.384*195.25 | 10 (0.1%) |
| 152*184*152 | 9 (0.1%) |
| 157.5*190*157.5 | 8 (0.1%) |
| 156.6*189*158.4 | 6 (0.0%) |
| 196.472*232.412*194.4 | 6 (0.0%) |
| 198.75*232.5*197.6 | 5 (0.0%) |
| 198*231*195 | 5 (0.0%) |
| 182*218*183 | 4 (0.0%) |
| 193.5*229.5*195 | 4 (0.0%) |
| 195*231*196.56 | 4 (0.0%) |
| 157*189*156 | 2 (0.0%) |
| 164*192*160 | 2 (0.0%) |
| 169.5*205.5*169.5 | 2 (0.0%) |
| 182.5*217.5*182.5 | 2 (0.0%) |
| 192.5*230*192.5 | 2 (0.0%) |
| 195.468*230.373*195.5 | 2 (0.0%) |
| 195.966*233.784*197.2 | 2 (0.0%) |
| 195*232.5*198 | 2 (0.0%) |
| 159*192*159 | 1 (0.0%) |
| 169*205*169 | 1 (0.0%) |
| 180*219*184 | 1 (0.0%) |
| 183*219*183.6 | 1 (0.0%) |
| 184.8*220.8*182.4 | 1 (0.0%) |
| 188*224*184 | 1 (0.0%) |
| 190.9*227.7*190.9 | 1 (0.0%) |
| 192.5*230.3125*192.5 | 1 (0.0%) |
| 193.6*228.8*193.6 | 1 (0.0%) |
| 193*229*193 | 1 (0.0%) |
| nvox | |
| Mean (SD) | 168000 (113000) |
| Median [Min, Max] | 169000 [1.00, 903000] |
| map_type | |
| univariate-beta map | 0 (0%) |
| variance | 0 (0%) |
| Z map | 2274 (16.1%) |
| other | 0 (0%) |
| T map | 11813 (83.9%) |
| ROI/mask | 0 (0%) |
| P map (given null hypothesis) | 0 (0%) |
| parcellation | 0 (0%) |
| F map | 0 (0%) |
| 1-P map ("inverted" probability) | 0 (0%) |
| anatomical | 0 (0%) |
| multivariate-beta map | 0 (0%) |
| Chi squared map | 0 (0%) |
| perc_bad_voxels | |
| Mean (SD) | 73.6 (10.8) |
| Median [Min, Max] | 75.2 [0, 87.2] |
| perc_voxels_outside | |
| Mean (SD) | 18.4 (23.5) |
| Median [Min, Max] | 15.5 [0, 100] |
@haozheng, because this depends on the initial image derived values, it should be moved up and executed with the exclusionary criteria at the top of this section. Include the plots.
Images with a low intensity range (abs(max - min)) are
unlikely to be valid statistical images, so we exclude all images that
have a range smaller than 0.01.
library(stringr)
library(ggplot2)
library(scales)
images <- readRDS("proportionalimagesubset_Section3.1.2.rds")
# parse [lo, hi]
m <- str_match(
images[["range"]],
"\\[\\s*([+-]?(?:\\d*\\.?\\d+(?:[eE][+-]?\\d+)?))\\s*,\\s*([+-]?(?:\\d*\\.?\\d+(?:[eE][+-]?\\d+)?))\\s*\\]"
)
lo <- as.numeric(m[, 2])
hi <- as.numeric(m[, 3])
# attach & compute range
images$range_low <- lo
images$range_high <- hi
images$range_diff <- images$range_high - images$range_low
# histogram of log10(range_diff) (positive values only)
xpos <- images$range_diff
xpos <- xpos[is.finite(xpos) & xpos > 0]
# xpos already defined above as positive, finite range_diff values
df_sc <- data.frame(
idx = seq_along(xpos),
l10 = log10(xpos)
)
tick_min <- floor(min(df_sc$l10, na.rm = TRUE))
tick_max <- ceiling(max(df_sc$l10, na.rm = TRUE))
ticks <- tick_min:tick_max
ggplot(df_sc, aes(idx, l10)) +
geom_point(alpha = 0.5, size = 0.6) +
scale_y_continuous(breaks = ticks, labels = scales::math_format(10^.x)) +
geom_hline(yintercept = ticks, linetype = "dotted", linewidth = 0.3, color = "grey70") +
labs(
title = "Figure 2: Scatterplot of range_diff (log10 scale) by index",
x = "Image index",
y = "range_diff (log10 scale)",
subtitle = sprintf("n = %d ", length(xpos))
) +
geom_hline(yintercept = 0.1,linetype = "dashed", color = "red") +
theme_minimal(base_size = 12)Scatter of range_diff (log10 scale) by index
# drop rows with log10(range_diff) < 0.1 (guard against non-positive / NA)
thr <- 0.1
idx_drop <- which(is.finite(images$range_diff) &
images$range_diff > 0 &
log10(images$range_diff) < thr)
images <- images[-idx_drop, , drop = FALSE]
cat("Removed:", length(idx_drop), "rows\n")## Removed: 993 rows
We resliced the images from each dimensionality to MNI2mm space and visually evaluated overlap with the MNI template (Figure XX). Several images were misaligned. Below is an example of some misaligned images.
Figure
3. Example of a misaligned image after rigid registration to
MNI2mm space.
To resolve the misalignment of reslicing, we evaluated registration of varying complexity:
For both rigid and linear registrations, we use mutual information
(MI) as the cost function (-cost mutualinfo in FLIRT) to
accommodate differences in contrast and intensity between the template
and statistical images. Results are shown for the seven most common
dimensions covering majority of the dataset, to get a snapshot of what
they would look like under different registration methods.
library(neurobase)
library(grid)
library(gridGraphics)
library(cowplot)
library(ggplot2)
stopifnot(exists("template"), file.exists(template))
mni <- readNIfTI(template)
images_subset <- readRDS("proportionalimagesubset_Section3.1.3.rds")
images_subset$dim_mm <- as.character(images_subset$dim_mm)
# @haozheng, this is hard coded, so should maybe be derived from the data directly
dims_to_show <- c("182*218*182","195*231*196",
"194*230*194","195*231*195","192*228*192","158*190*158","195*232.5*195")
redmap <- colorRampPalette(c("blue","red","yellow"))(256)
first_existing <- function(v) {
v <- v[!is.na(v)]
i <- which(file.exists(v))
if (length(i)) v[i[1]] else NA_character_
}
ortho2_grob <- function(bg, ovl, title, zlims) {
grid.echo(function() {
op <- par(mai = c(0.1, 0.1, 0.55, 0.2), cex.main = 0.75)
on.exit(par(op), add = TRUE)
ortho2(bg, ovl,
col.x = gray(0:64/64),
col.y = redmap,
text = title,
text.cex = 0.8,
NA.x = TRUE, NA.y = TRUE,
ycolorbar = TRUE,
zlim.y = zlims)
})
grid.grab()
}
has_flirt <- nzchar(Sys.which("flirt"))
grobs <- list()
labels <- character(0)
for (d in dims_to_show) {
img_file <- first_existing(images_subset$localfile[images_subset$dim_mm == d])
if (is.na(img_file)) {
message("No files for dim_mm = ", d)
next
}
# initialize panels
gR <- gL <- gS <- ggplot() + theme_void() +
annotate("text", x = 0, y = 0, label = "No panel", size = 5)
if (!is.na(img_file) && has_flirt) {
out1 <- tempfile(fileext = "_rigid.nii.gz")
out2 <- tempfile(fileext = "_affine.nii.gz")
out3 <- tempfile(fileext = "_reslice.nii.gz")
# Rigid
system2("flirt", c("-in", img_file, "-ref", template,
"-out", out1, "-dof", "6",
"-cost", "mutualinfo", "-interp", "trilinear"))
if (file.exists(out1)) {
stat6 <- readNIfTI(out1)
z6 <- quantile(stat6, probs = c(0.7, 1.0), na.rm = TRUE)
stat6[stat6 < z6[1]] <- NA
gR <- ggdraw(ortho2_grob(mni, stat6,
paste0("Rigid (DOF=6) — \n", gsub("\\*", "×", d)), z6))
}
# Linear
system2("flirt", c("-in", img_file, "-ref", template,
"-out", out2, "-dof", "12",
"-cost", "mutualinfo", "-interp", "trilinear"))
if (file.exists(out2)) {
stat12 <- readNIfTI(out2)
z12 <- quantile(stat12, probs = c(0.7, 1.0), na.rm = TRUE)
stat12[stat12 < z12[1]] <- NA
gL <- ggdraw(ortho2_grob(mni, stat12,
paste0("Linear (DOF=12) — \n", gsub("\\*", "×", d)), z12))
}
# Reslice (2 mm isotropic)
system2("flirt", c("-in", img_file, "-ref", template,
"-out", out3, "-applyisoxfm", "2",
"-interp", "trilinear"))
if (file.exists(out3)) {
stat_reslice <- readNIfTI(out3)
z_reslice <- quantile(stat_reslice, probs = c(0.7, 1.0), na.rm = TRUE)
stat_reslice[stat_reslice < z_reslice[1]] <- NA
gS <- ggdraw(ortho2_grob(mni, stat_reslice,
paste0("Reslicing (2mm) — \n", gsub("\\*", "×", d)), z_reslice))
}
} else if (!has_flirt) {
gS <- gR <- gL <- ggplot() + theme_void() +
annotate("text", x = 0, y = 0, label = "FLIRT not found on PATH", size = 5)
}
# Save panels
row_plot <- plot_grid(gS, gR, gL, ncol = 3, greedy = FALSE)
grobs[[length(grobs) + 1]] <- row_plot
labels <- c(labels, gsub("\\*", "×", d))
outfile <- sprintf("registrations_%s.png", gsub("\\*", "x", d))
ggsave(outfile, row_plot, width = 15, height = 3, dpi = 1000, limitsize = FALSE)
message("Saved: ", outfile)
}Based on these results, we chose rigid registration, as a balance between flexibility and accuracy across the categories. With the registered images, we further extracted information from the data:
voxdim_new: indicates the new voxel dimension of the
images, now all images should be 2mm isotropic.imgdim_new: indicates the new image dimension, now all
images should be 91*109*91.range_new: indicates the new range of voxel
intensity.library(RNifti)
library(oro.nifti)
library(pbapply)
library(parallel)
outdir = '/media/big/neurovault'
images <- readRDS("proportionalimagesubset_Section3.1.3.rds")
## FSL Setup - Robust
if (!nzchar(Sys.which("flirt"))) {
if (!nzchar(Sys.getenv("FSLDIR"))) {
fsl_candidates <- c("/usr/local/fsl", "/usr/share/fsl", "/opt/fsl")
existing <- file.exists(fsl_candidates)
if (any(existing)) {
Sys.setenv(FSLDIR = fsl_candidates[which(existing)[1]])
} else {
stop("FSL not found. Please install FSL and set FSLDIR environment variable")
}
}
Sys.setenv(FSLOUTPUTTYPE = "NIFTI_GZ")
fslbin <- file.path(Sys.getenv("FSLDIR"), "bin")
Sys.setenv(PATH = paste(fslbin, Sys.getenv("PATH"), sep = ":"))
}
## Reference image
ref <- file.path(Sys.getenv("FSLDIR"), "data", "standard", "MNI152_T1_2mm_brain.nii.gz")
## Configuration
top_dir <- outdir
## Helpers
outname <- function(f, tag = "_rigid_cost_step1") {
out_dir <- dirname(f)
base_name <- sub("\\.nii(\\.gz)?$", "", basename(f), ignore.case = TRUE)
file.path(out_dir, paste0(base_name, tag, ".nii.gz"))
}
validate_flirt <- function() {
flirt <- Sys.which("flirt")
if (!nzchar(flirt)) stop("FLIRT not found in PATH")
if (system2(flirt, args = "-version", stdout = FALSE, stderr = FALSE) != 0) {
stop("FLIRT exists but failed to run. Please check FSL installation")
}
flirt
}
flirt <- validate_flirt() ## Moved up here
reslice_image <- function(input, output, ref) {
if (!file.exists(output)) {
message(sprintf("Reslicing: %s", basename(input)))
cmd <- sprintf("%s -in '%s' -ref '%s' -out '%s' -dof 6 -cost mutualinfo -interp trilinear",
flirt, input, ref, output)
status <- system(cmd, ignore.stdout = TRUE, ignore.stderr = TRUE)
if (status != 0 || !file.exists(output) || file.size(output) == 0) {
warning("FLIRT failed for: ", input)
return(NA_character_)
}
}
return(output)
}
## Generate resliced output paths
images$resliced <- vapply(images$localfile, outname, character(1))
## Reslice
results <- mcmapply(
FUN = reslice_image,
input = images$localfile,
output = images$resliced,
MoreArgs = list(ref = ref),
mc.cores = n_cores
)
## Store results
images$resliced_path[!is.na(results)] <- unlist(results[!is.na(results)])
## Save final
saveRDS(images, "proportionalimagesubset_Section3.2.rds")library(stringr)
resliced_images2 <- readRDS("proportionalimagesubset_Section3.2.rds")
cmd_vox <- paste('/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d', resliced_images2$resliced_path, '-threshold 0 0 0 1 -voxel-sum')
nvox <- unlist(mclapply(cmd_vox, system, intern = TRUE, mc.cores = n_cores))
resliced_images2$nvox_new_section3.2 <- as.numeric(gsub("[^0-9.]", "", nvox))
# Extract new info with c3d -info
cmd_info <- paste("/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d", resliced_images2$resliced_path, "-info")
info_resliced <- mclapply(cmd_info, system, intern = TRUE, mc.cores = n_cores)
extractField <- function(x, field_regex) {
line <- x[1]
match <- regmatches(line, regexpr(field_regex, line))
return(match)
}
x <- resliced_images2$range_new
# pull the two numbers inside the brackets, robust to spaces & exponents
resliced_images2$voxdim_new <- sapply(info_resliced, extractField, "vox = \\[.*?\\]")
resliced_images2$imgdim_new <- sapply(info_resliced, extractField, "dim = \\[.*?\\]")
resliced_images2$orient_new <- sapply(info_resliced, extractField, "orient = [A-Z]+")
resliced_images2$range_new <- sapply(info_resliced, extractField, "range = \\[.*?\\]")
saveRDS(resliced_images2, "proportionalimagesubset_Section3.3.rds")From the above section, we found two types of abnormal images to take care of: * 1. Images with NaN values. * 2. Images with values everywhere (including inside and outside the brain), which is the ones with very high values of non-zero voxels.
Among the six most frequent dimensions viewed in 3.1, note that for 194×230×194 and 192×228×192, there are non-zero voxels outside the brain everywhere. Additionally, there is a considerable amount of data points that have NaN values inside the brain, either inside or outside the brain mask. This is because of the original image files having NaN values, and the fsl flirt software cannot properly work with NaN cases. Therefore, in this section, we would turn all the NaN values in our registered images into the zeros.
To take care of the images with active voxels everywhere, we set a fixed \(\varepsilon\) (0.0003) here, to filter out voxels with values around zero by \(\varepsilon\) in all images. Our rationale behind such an epsilon selection is that most of the images with active voxels everywhere has very low values outside the brain, usually around 0.00003. Therefore, filtering out everything around zero by 0.0003 would ideally set every voxel outside to be zero. Moreover, even if some voxels inside the brain are accidentally set to zero because of this, given its extremely low value, it would also be fine for us to simply set them to be zero.
resliced_images2 <- readRDS("proportionalimagesubset_Section3.3.rds")
# Convert NaN into zeros
cmd <- paste('/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d', resliced_images2$resliced_path, '-replace NaN 0.0 -o', resliced_images2$resliced_path)
eps <- 3e-4
files_all <- resliced_images2$resliced_path
keep <- !is.na(files_all) & file.exists(files_all)
files <- files_all[keep]
# new output names next to originals:
outs <- sub("\\.nii(\\.gz)?$", "_eps.nii.gz", files, ignore.case = TRUE)
cmd <- sprintf("%s %s -dup -threshold -%g %g 0 1 -multiply -o %s",
shQuote("/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d"),
shQuote(files), eps, eps, shQuote(outs))
invisible(parallel::mclapply(cmd, function(c) system(paste(c, "2>&1"), intern = TRUE),
mc.cores = n_cores))
resliced_images2$eps_path <- NA_character_
resliced_images2$eps_path[keep] <- outs
cmd_vox <- paste('/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d', resliced_images2$eps_path, '-threshold 0 0 0 1 -voxel-sum')
nvox <- unlist(mclapply(cmd_vox, system, intern = TRUE, mc.cores = n_cores))
# The variable non_zero_voxels_section3.3 represent the number of non-zero after the above two manipulations
resliced_images2$non_zero_voxels_new <- as.numeric(gsub("[^0-9.]", "", nvox))
saveRDS(resliced_images2, "proportionalimagesubset_Section3.3.rds") Until this point, since we have manipulated the number of non-zero
voxels even further by thresholding above 0.0003, and removed the NaN
voxels, it’s necessary to have a new measure over the number of non-zero
voxels: * non_zero_voxels_new: indicates the number of
non-zero voxels after the above manipulations.
library(parallel)
n_cores <- min(24, max(1, parallel::detectCores() - 2))
# Mask files
gm_mask <- "/media/disk2/ree/grey_matter_binary_mask.nii.gz"
wm_mask <- "/media/disk2/ree/white_matter_binary_mask.nii.gz"
mask_2mm_file <- "/media/disk2/ree/brain_binary_mask.nii.gz"
mask_2mm_file_complement <- "/media/disk2/ree/brain_binary_mask_complement.nii.gz"
# Paths
c3d_bin <- "/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d"
# Load metadata
resliced_images2 <- readRDS("proportionalimagesubset_Section3.3.rds")
files_all <- resliced_images2$eps_path
keep <- !is.na(files_all) & file.exists(files_all)
files <- files_all[keep]
idx <- which(keep)
# Build vectorized commands: (image nonzero->1) * (mask) -> voxel-sum
mk_cmd <- function(mask_path) {
sprintf('%s %s -threshold 0 0 0 1 %s -multiply -voxel-sum',
c3d_bin, files, mask_path)
}
cmd_outside <- mk_cmd(mask_2mm_file_complement)
cmd_gm <- mk_cmd(gm_mask)
cmd_wm <- mk_cmd(wm_mask)
nnz_outside <- unlist(mclapply(cmd_outside, system, intern = TRUE, mc.cores = n_cores))
nnz_gm <- unlist(mclapply(cmd_gm, system, intern = TRUE, mc.cores = n_cores))
nnz_wm <- unlist(mclapply(cmd_wm, system, intern = TRUE, mc.cores = n_cores))
nvox_outside <- nnz_outside
nvox_gm <- nnz_gm
nvox_wm <- nnz_wm
# Write back (order-preserving; NAs for missing files)
resliced_images2$nvox_outside[idx] <- nvox_outside
resliced_images2$nvox_gm[idx] <- nvox_gm
resliced_images2$nvox_wm[idx] <- nvox_wm
cols <- c("nvox_outside","nvox_gm","nvox_wm")
resliced_images2[, (cols) := lapply(as.data.frame(resliced_images2[, ..cols]),
function(x) as.numeric(sub(".*?([-+0-9.eE]+)\\s*$","\\1", x)))]
# Save updated metadata
saveRDS(resliced_images2, "proportionalimagesubset_Section3.4.1.rds")# thresholds
GM_MIN <- 100000
WM_MIN <- 20000
OUT_MAX <- 100000
gm_mask <- "/media/disk2/ree/grey_matter_binary_mask.nii.gz"
GM_total <- sum(readNifti(gm_mask))
wm_mask <- "/media/disk2/ree/white_matter_binary_mask.nii.gz"
WM_total <- sum(readNifti(wm_mask))
resliced_images2 <- readRDS("proportionalimagesubset_Section3.4.1.rds")
out_total <- 91*109*91 - GM_total - WM_total
resliced_images2$gm_perc <- resliced_images2$nvox_gm / GM_total
resliced_images2$wm_perc <- resliced_images2$nvox_wm / WM_total
resliced_images2$outside_perc <- resliced_images2$nvox_outside / out_total
library(GGally)
library(ggplot2)
library(scales)
library(rlang)##
## Attaching package: 'rlang'
## The following objects are masked from 'package:jsonlite':
##
## flatten, unbox
# --- proportions already computed above ---
coverage_df <- resliced_images2[, c("gm_perc", "wm_perc", "outside_perc")]
coverage_df <- na.omit(coverage_df)
colnames(coverage_df) <- c("Grey Matter", "White Matter", "Outside Mask")
# convert to percent for plotting
coverage_pct <- coverage_df * 100
# convert voxel-count thresholds to percent thresholds
ref_pct <- c(
"Grey Matter" = 100 * GM_MIN / GM_total,
"White Matter" = 100 * WM_MIN / WM_total,
"Outside Mask" = 100 * OUT_MAX / out_total
)
# formatter for axes
pct_lab <- scales::label_percent(scale = 1, accuracy = 1) # since data already in 0-100
diag_fun <- function(data, mapping, ...) {
p <- GGally::ggally_barDiag(data, mapping, ...)
var <- as_label(mapping$x)
thr <- ref_pct[[var]]
if (!is.null(thr)) p <- p + geom_vline(xintercept = thr, linetype = "dashed")
p + scale_x_continuous(labels = pct_lab)
}
lower_fun <- function(data, mapping, ...) {
p <- GGally::ggally_smooth(data, mapping, ...)
xvar <- as_label(mapping$x); yvar <- as_label(mapping$y)
if (!is.null(ref_pct[[xvar]])) p <- p + geom_vline(xintercept = ref_pct[[xvar]], linetype = "dashed")
if (!is.null(ref_pct[[yvar]])) p <- p + geom_hline(yintercept = ref_pct[[yvar]], linetype = "dashed")
p +
scale_x_continuous(labels = pct_lab) +
scale_y_continuous(labels = pct_lab)
}
upper_fun <- function(data, mapping, ...) {
GGally::ggally_cor(data, mapping, ...) +
theme_void()
}
GGally::ggpairs(
coverage_pct,
title = "Figure 4: Voxel Coverage Comparison (Percent)",
upper = list(continuous = "cor"),
diag = list(continuous = diag_fun),
lower = list(continuous = wrap(lower_fun, alpha = 0.35, size = 0.6))
)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We perform further exclusions based on coverage information of the rigidly registered images based on the following metrics:
resliced_images2 <- readRDS("proportionalimagesubset_Section3.4.1.rds")
# coerce just in case
resliced_images2$nvox_gm <- as.numeric(resliced_images2$nvox_gm)
resliced_images2$nvox_wm <- as.numeric(resliced_images2$nvox_wm)
resliced_images2$nvox_outside <- as.numeric(resliced_images2$nvox_outside)
# keep ONLY rows that pass all three checks (and are finite)
keep <- with(resliced_images2,
is.finite(nvox_gm) & nvox_gm >= GM_MIN &
is.finite(nvox_wm) & nvox_wm >= WM_MIN &
is.finite(nvox_outside) & nvox_outside <= OUT_MAX)
resliced_images2 <- resliced_images2[keep, , drop = FALSE]
cat("Kept:", sum(keep), " Dropped:", sum(!keep), "\n")
# save the filtered dataset
saveRDS(resliced_images2, "proportionalimagesubset_Section3.5.rds")The following shows what the dataset looks like after excluding.
resliced_images2 = readRDS("proportionalimagesubset_Section3.5.rds")
# Select relevant columns for plotting
coverage_df <- resliced_images2[, c("nvox_gm", "nvox_wm", "nvox_outside")]
# Ensure no NAs and all numeric
coverage_df <- na.omit(coverage_df)
colnames(coverage_df) <- c("Grey Matter", "White Matter", "Outside Mask")
library(GGally)
library(rlang) # for as_label()
# thresholds for each variable
ref <- c("Grey Matter" = GM_MIN,
"White Matter" = WM_MIN,
"Outside Mask" = OUT_MAX)
# diagonal: histogram + vertical dashed line at the var's threshold
diag_fun <- function(data, mapping, ...) {
p <- GGally::ggally_barDiag(data, mapping, ...)
var <- as_label(mapping$x)
thr <- ref[[var]]
if (!is.null(thr)) p <- p + geom_vline(xintercept = thr, linetype = "dashed")
p
}
# lower: smooth scatter + vline/hline at the x/y thresholds
lower_fun <- function(data, mapping, ...) {
p <- GGally::ggally_smooth(data, mapping, ...)
xvar <- as_label(mapping$x); yvar <- as_label(mapping$y)
if (!is.null(ref[[xvar]])) p <- p + geom_vline(xintercept = ref[[xvar]], linetype = "dashed")
if (!is.null(ref[[yvar]])) p <- p + geom_hline(yintercept = ref[[yvar]], linetype = "dashed")
p
}
ggpairs(
coverage_df,
title = "Figure 5: Voxel Coverage Comparison",
upper = list(continuous = "cor"),
diag = list(continuous = diag_fun),
lower = list(continuous = wrap(lower_fun, alpha = 0.35, size = 0.6))
)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here, we convert nonzero values in the T and Z statistical images
into the signed version of the robust effect size index (RESI; using
RESI::t2S for T maps and RESI::z2S for Z
maps), based on each metadata sample size reported with the images. We
save the resulting effect size images and summary statistics within gray
and white matter (mean, min, max, and quartiles (0.25, 0.5, 0.75)), are
computed to perform image quality evaluation. Given that registration of
the prior step could potentially break some images, we perform a simple
exclusion based on missingness of data entries.
library(RNifti)
library(parallel)
library(RESI)
# Load masks once
gm_mask <- RNifti::readNifti("/media/disk2/ree/grey_matter_binary_mask.nii.gz")
wm_mask <- RNifti::readNifti("/media/disk2/ree/white_matter_binary_mask.nii.gz")
# Load dataset
resliced_meaningful_images <- readRDS("proportionalimagesubset_Section3.5.rds")
resliced_meaningful_images <- resliced_meaningful_images[!is.na(resliced_meaningful_images$number_of_subjects), ]
# Predefine output columns
num_cols <- c("es_min","es_1qt","es_median","es_3qt","es_max","es_mean",
"grey_mean","grey_min","grey_max",
"white_mean","white_min","white_max")
for (nm in num_cols) resliced_meaningful_images[[nm]] <- NA_real_
resliced_meaningful_images$effect_size_file <- NA_character_
resliced_meaningful_images$grey_file <- NA_character_
resliced_meaningful_images$white_file <- NA_character_
# Parallel cores
n_cores <- 24
# Main computation
results <- mclapply(seq_len(nrow(resliced_meaningful_images)), function(i) {
tryCatch({
file_path <- resliced_meaningful_images$eps_path[i]
if (is.na(file_path) || !nzchar(file_path) || !file.exists(file_path)) return(NULL)
n_subj <- resliced_meaningful_images$number_of_subjects[i]
map_type <- as.character(resliced_meaningful_images$map_type[i])
img <- RNifti::readNifti(file_path)
nonzero_mask <- (img != 0) & !is.na(img)
valid_vals <- img[nonzero_mask]
effect_img <- img
if (identical(map_type, "T map")) {
effect_img[nonzero_mask] <- RESI::t2S(valid_vals, rdf = n_subj - 1, n = n_subj)
} else if (identical(map_type, "Z map")) {
effect_img[nonzero_mask] <- RESI::z2S(valid_vals, n = n_subj, unbiased = TRUE)
} else {
return(NULL)
}
effect_img[!is.finite(effect_img)] <- 0
output_dir <- dirname(file_path)
if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)
base_name <- sub("\\.nii\\.gz$", "", basename(file_path))
out_file <- file.path(output_dir, paste0(base_name, "_effect_size.nii.gz"))
RNifti::writeNifti(effect_img, out_file)
es_vals <- effect_img[effect_img != 0 & !is.na(effect_img)]
es_stats <- if (length(es_vals) > 0) {
c(
es_min = min(es_vals),
es_1qt = as.numeric(stats::quantile(es_vals, 0.25, type = 7, names = FALSE)),
es_median = median(es_vals),
es_3qt = as.numeric(stats::quantile(es_vals, 0.75, type = 7, names = FALSE)),
es_max = max(es_vals),
es_mean = mean(es_vals)
)
} else rep(NA_real_, 6)
grey_masked <- effect_img * gm_mask
white_masked <- effect_img * wm_mask
grey_vals <- grey_masked[grey_masked != 0]
white_vals <- white_masked[white_masked != 0]
grey_file <- sub("\\.nii\\.gz$", "_masked_grey_after_es.nii.gz", out_file)
white_file <- sub("\\.nii\\.gz$", "_masked_white_after_es.nii.gz", out_file)
RNifti::writeNifti(grey_masked, grey_file)
RNifti::writeNifti(white_masked, white_file)
list(
index = i,
effect_size_file = out_file,
grey_file = grey_file,
white_file = white_file,
es_min = es_stats[1],
es_1qt = es_stats[2],
es_median = es_stats[3],
es_3qt = es_stats[4],
es_max = es_stats[5],
es_mean = es_stats[6],
grey_mean = if (length(grey_vals) > 0) mean(grey_vals) else NA_real_,
grey_min = if (length(grey_vals) > 0) min(grey_vals) else NA_real_,
grey_max = if (length(grey_vals) > 0) max(grey_vals) else NA_real_,
white_mean = if (length(white_vals) > 0) mean(white_vals) else NA_real_,
white_min = if (length(white_vals) > 0) min(white_vals) else NA_real_,
white_max = if (length(white_vals) > 0) max(white_vals) else NA_real_
)
}, error = function(e) {
message(" Skipping index ", i, ": ", e$message)
NULL
})
}, mc.cores = n_cores)
# Filter and bind
results <- Filter(Negate(is.null), results)
df_results <- do.call(rbind, lapply(results, as.data.frame))
df_results$index <- as.integer(df_results$index)
# Match and update
cols_to_update <- intersect(names(df_results), names(resliced_meaningful_images))
for (nm in setdiff(cols_to_update, "index")) {
resliced_meaningful_images[df_results$index, nm] <- df_results[[nm]]
}
# Save final dataset
saveRDS(resliced_meaningful_images, "proportionalimagesubset_Section4.1.rds")The mean value of the effect size is proportional to \(n_i^{-1} V_i^{-2}\), where \(n_i\) denotes the number of subjects (sample size) and \(V_i\) denotes the number of nonzero voxels in study \(i\).
After converting each study’s test-statistic image to an effect-size image, we summarize each image by its mean effect size within gray matter (GM) and white matter (WM). To screen for implausible or artifactual images, we model the expected variability of these mean effect sizes as a function of (i) study sample size and (ii) voxel coverage, and flag observations that fall outside the implied funnel.
Let \(y_{ij}\) denote the observed mean effect size for image \(j\) from collection \(c[i]\). We fit an intercept-only linear mixed-effects model with a collection-specific random intercept:
\[y_{ij} = \mu + u_{c[i]} + \varepsilon_{ij},\]
where \(\mu\) is the grand mean effect size and \(u_{c[i]} \sim \mathcal{N}(0, \tau^2)\) captures between-collection heterogeneity.
To account for heteroskedasticity across images, we model the residual variance using a power-of-covariates structure implemented via the \(\texttt{varPower}\) variance function in \(\texttt{nlme}\), and combine multiple variance components using . Specifically, for study-level covariates \(N_{ij}\) (sample size) and \(V_{ij}\) (scaled voxel coverage), we assume
\[\mathrm{Var}(\varepsilon_{ij}) = \sigma^2 \, |N_{ij}|^{2\delta_1} \, |V_{ij}|^{2\delta_2},\]
so that residual variability decreases as \(N\) and/or \(V\) increase when \(\delta_1 < 0\) and/or \(\delta_2 < 0\).
The total per-observation variance (and thus the width of the funnel) incorporates both residual and between-collection variability:
\[SD_{ij}^2 = \tau^2 + \sigma^2 \, |N_{ij}|^{2\delta_1} \, |V_{ij}|^{2\delta_2}.\]
This quantity represents the model-implied of observed mean effect sizes as a function of sample size and voxel coverage.
Outlier detection rule. We flag an observation as an outlier if its deviation from the fitted grand mean exceeds a multiple of its modeled standard deviation:
\[\bigl| y_{ij} - \hat{\mu} \bigr| > k \, SD_{ij},\]
with \(k = 3\). Equivalently, outliers are defined as observations lying outside the funnel boundaries \(\hat{\mu} \pm k \, SD_{ij}\).
This criterion standardizes residuals across studies while accounting for heteroskedasticity and within-collection correlation. Observations with standardized residuals exceeding \(\pm 3\) were flagged as inconsistent with the fitted noise model and removed. The same procedure was applied separately to gray matter and white matter mean effect sizes.
These analyses serve both to validate the effect-size transformation and to provide a principled, model-based mechanism for identifying implausible images prior to downstream inference. While alternative filtering strategies were explored, we selected the mixed-effects variance modeling approach as the most statistically grounded and interpretable method for screening artifactual data points. A summary table consolidating key model estimates and variance components is provided below.
library(dplyr)
library(RNifti)
library(knitr)
resliced_images2 <- readRDS("proportionalimagesubset_Section4.1.rds")
GM_total <- sum(readNifti("/media/disk2/ree/grey_matter_binary_mask.nii.gz"))
WM_total <- sum(readNifti("/media/disk2/ree/white_matter_binary_mask.nii.gz"))
OUT_total <- 91 * 109 * 91 - GM_total - WM_total
q <- function(x) quantile(x, c(.25, .5, .75), na.rm = TRUE)
n_images <- nrow(resliced_images2)
n_collections <- dplyr::n_distinct(resliced_images2$collection_id)
N <- resliced_images2$number_of_subjects
gm <- resliced_images2$nvox_gm
wm <- resliced_images2$nvox_wm
out <- resliced_images2$nvox_outside
es_gm <- resliced_images2$grey_mean
es_wm <- resliced_images2$white_mean
summary_tbl <- data.frame(
Statistic = c(
"Number of images",
"Number of collections",
"Sample size, median [Q1, Q3]",
"Gray matter coverage, median % [Q1, Q3]",
"White matter coverage, median % [Q1, Q3]",
"Outside-mask coverage, median % [Q1, Q3]",
"Gray matter mean effect size, median [Q1, Q3]",
"White matter mean effect size, median [Q1, Q3]"
),
Value = c(
n_images,
n_collections,
sprintf("%.0f [%.0f, %.0f]", q(N)[2], q(N)[1], q(N)[3]),
sprintf("%.2f%% [%.2f, %.2f]",
100*q(gm/GM_total)[2], 100*q(gm/GM_total)[1], 100*q(gm/GM_total)[3]),
sprintf("%.2f%% [%.2f, %.2f]",
100*q(wm/WM_total)[2], 100*q(wm/WM_total)[1], 100*q(wm/WM_total)[3]),
sprintf("%.2f%% [%.2f, %.2f]",
100*q(out/OUT_total)[2], 100*q(out/OUT_total)[1], 100*q(out/OUT_total)[3]),
sprintf("%.3f [%.3f, %.3f]", q(es_gm)[2], q(es_gm)[1], q(es_gm)[3]),
sprintf("%.3f [%.3f, %.3f]", q(es_wm)[2], q(es_wm)[1], q(es_wm)[3])
),
stringsAsFactors = FALSE
)
knitr::kable(summary_tbl, caption = "Summary statistics for effect-size screening")| Statistic | Value |
|---|---|
| Number of images | 12188 |
| Number of collections | 3255 |
| Sample size, median [Q1, Q3] | 30 [18, 53] |
| Gray matter coverage, median % [Q1, Q3] | 97.96% [91.22, 99.20] |
| White matter coverage, median % [Q1, Q3] | 98.85% [94.56, 99.54] |
| Outside-mask coverage, median % [Q1, Q3] | 3.64% [1.18, 8.42] |
| Gray matter mean effect size, median [Q1, Q3] | 0.004 [-0.037, 0.055] |
| White matter mean effect size, median [Q1, Q3] | 0.001 [-0.037, 0.039] |
# This chunk is no longer needed since the variance prediction is not right
library(nlme)
library(ggplot2)
library(patchwork)
library(knitr)
# Load data
dat <- readRDS("proportionalimagesubset_Section4.1.rds")
dat <- dat[!is.na(dat$es_mean), ]
dat$voxel_scaled <- dat$non_zero_voxels_section3.3/100000
m_power <- lme(
es_mean ~ 1,
random = ~ 1 | collection_id,
weights = varComb(
varPower(form = ~ number_of_subjects),
varPower(form = ~ voxel_scaled)),
data = dat,
method = "REML"
)
# Extract overall effect
overall_effect <- fixef(m_power)
# Get per-observation residual variance from model
var_fun <- predict(m_power, type = "var") # residual variance per obs
resid_sd <- sqrt(var_fun)
# Total SE = residual SD + tau² from random effects
tau2 <- as.numeric(VarCorr(m_power)[1, 1])
dat$se_model <- sqrt(var_fun + tau2)
# Standardized residuals (for panel A)
dat$res_std <- residuals(m_power, type = "pearson")
# Outlier definition (funnel-based only, cutoff ±k·SE)
k <- 3
dat$outlier <- abs(dat$es_mean - overall_effect) > k * dat$se_model
dat$outlier[is.na(dat$outlier)] <- FALSE # fix NA -> FALSE
# Sorted versions for smooth funnel lines
dat_sorted <- dat[order(dat$number_of_subjects), ]
dat_clean <- subset(dat_sorted, !outlier)
dat_sorted <- subset(
dat_sorted,
is.finite(number_of_subjects) & number_of_subjects > 0 &
is.finite(es_mean) &
is.finite(se_model)
)
dat_clean <- subset(
dat_clean,
is.finite(number_of_subjects) & number_of_subjects > 0 &
is.finite(es_mean) &
is.finite(se_model)
)
dat_plot <- subset(
dat,
is.finite(number_of_subjects) & number_of_subjects > 0 &
is.finite(es_mean) &
is.finite(se_model)
)
# Funnel before exclusion
n_out <- sum(dat$outlier)
n_tot <- nrow(dat)
# illustrative "two normals" band (hourglass/violin-like)
center <- 0
# fixed log-x domain [0, 4]
band_grid <- data.frame(
log_n = seq(0, 4, length.out = 800)
)
# controls for the shape
mu <- 2.0 # where the band is widest (log10(n)=2 -> n=100)
sig <- 0.55 # spread of the bump (bigger = wider hump)
w_min <- 0.25 # thickness at the ends (log_n near 0 or 4)
w_max <- 4.0 # max half-width at the center
# Gaussian width in log space
band_grid$w <- w_min + (w_max - w_min) * exp(-0.5 * ((band_grid$log_n - mu) / sig)^2)
band_grid$lower <- center - band_grid$w
band_grid$upper <- center + band_grid$w
ggplot(dat_plot, aes(x = log10(number_of_subjects), y = es_mean, color = outlier)) +
geom_point(alpha = 0.6, na.rm = TRUE) +
# purple jagged lines
geom_line(
aes(y = as.numeric(overall_effect) + as.numeric(k) * se_model, group = 1),
color = "purple",
na.rm = TRUE
) +
geom_line(
aes(y = as.numeric(overall_effect) - as.numeric(k) * se_model, group = 1),
color = "purple",
na.rm = TRUE
) +
scale_color_manual(values = c("black", "red"),
labels = c("Inlier", "Outlier")) +
# fixed x-axis 0..4 with labels in n
scale_x_continuous(
limits = c(0, 4),
breaks = 0:4,
labels = c("1", "10", "100", "1000", "10000")
) +
coord_cartesian(ylim = c(-10, 10)) +
labs(
title = "Figure 7: Effect Size vs Sample Size",
x = "Sample size (log scale)",
y = "Effect Size Mean"
) +
theme_minimal()library(nlme)
library(ggplot2)
library(patchwork)
library(knitr)
library(sjPlot)
dat <- readRDS("proportionalimagesubset_Section4.1.rds")
dat <- dat[!is.na(dat$es_mean), ]
dat$voxel_scaled <- dat$non_zero_voxels_section3.3 / 100000
m_power <- lme(
es_mean ~ 1,
random = ~ 1 | collection_id,
weights = varComb(
varPower(form = ~ number_of_subjects),
varPower(form = ~ voxel_scaled)
),
data = dat,
method = "REML"
)
overall_effect <- as.numeric(fixef(m_power))
sigma2 <- sigma(m_power)^2
tau2 <- as.numeric(VarCorr(m_power)[1, "Variance"])
# varPower exponents (delta_n, delta_v)
expParams <- intervals(m_power)$varStruct[, 2]
delta_n <- expParams[1]
delta_v <- expParams[2]
dat$se_model <- sqrt(
tau2 + sigma2 *
(dat$number_of_subjects ^ (2*delta_n)) *
(dat$voxel_scaled ^ (2*delta_v))
)
# standardized residuals (optional panel A)
dat$res_std <- residuals(m_power, type = "pearson")
k <- 3
dat$outlier <- abs(dat$es_mean - overall_effect) > k * dat$se_model
dat$outlier[is.na(dat$outlier)] <- FALSE
dat_plot <- subset(
dat,
is.finite(number_of_subjects) & number_of_subjects > 0 &
is.finite(es_mean) & is.finite(se_model)
)
dat_sorted <- dat_plot[order(dat_plot$number_of_subjects), ]
dat_clean <- subset(dat_sorted, !outlier)
n_out <- sum(dat_plot$outlier)
n_tot <- nrow(dat_plot)
grid <- data.frame(
number_of_subjects = seq(1, 10000, length.out = 800)
)
# choose a constant voxel_scaled for the band (median is a sensible default)
grid$voxel_scaled <- median(dat_plot$voxel_scaled, na.rm = TRUE)
grid$se_model <- sqrt(
tau2 + sigma2 *
(grid$number_of_subjects ^ (2*delta_n)) *
(grid$voxel_scaled ^ (2*delta_v))
)
grid$upper <- overall_effect + k * grid$se_model
grid$lower <- overall_effect - k * grid$se_model
grid$log_n <- log10(grid$number_of_subjects)
p_log <- ggplot(dat_plot, aes(x = log10(number_of_subjects), y = es_mean, color = outlier)) +
geom_point(alpha = 0.6, na.rm = TRUE) +
geom_ribbon(
data = grid,
aes(x = log_n, ymin = lower, ymax = upper),
inherit.aes = FALSE,
fill = "green",
alpha = 0.18
) +
coord_cartesian(ylim = c(-5, 10)) +
geom_line(data = grid, aes(x = log_n, y = lower),
inherit.aes = FALSE, color = "green", linewidth = 0.8) +
geom_line(data = grid, aes(x = log_n, y = upper),
inherit.aes = FALSE, color = "green", linewidth = 0.8) +
scale_color_manual(values = c("black", "red"),
labels = c("Inlier", "Outlier")) +
scale_x_continuous(
limits = c(0, 4),
breaks = 0:4,
labels = c("1", "10", "100", "1000", "10000")
) +
labs(
title = paste0(
"Figure 8: Effect Size vs Sample Size (model-implied ±", k,
"·SD band; voxel_scaled=", round(grid$voxel_scaled[1], 3), ")"
),
x = "Sample size (log scale)",
y = "Effect Size Mean"
) +
theme_minimal()
p_loglibrary(plotly)
# Use the same parameters you already computed:
# overall_effect, sigma2, tau2, delta_n, delta_v, k
# Build a grid over n and voxel_scaled
n_seq <- 10^seq(0, 4, length.out = 60) # 1 ... 10000 on log scale
v_seq <- seq(
quantile(dat_plot$voxel_scaled, 0.05, na.rm = TRUE),
quantile(dat_plot$voxel_scaled, 0.95, na.rm = TRUE),
length.out = 60
)
grid3d <- expand.grid(number_of_subjects = n_seq, voxel_scaled = v_seq)
grid3d$se_model <- sqrt(
tau2 + sigma2 *
(grid3d$number_of_subjects ^ (2 * delta_n)) *
(grid3d$voxel_scaled ^ (2 * delta_v))
)
grid3d$upper <- overall_effect + k * grid3d$se_model
grid3d$lower <- overall_effect - k * grid3d$se_model
# Matrices for plotly surfaces
xg <- sort(unique(log10(grid3d$number_of_subjects)))
yg <- sort(unique(grid3d$voxel_scaled))
UP <- matrix(grid3d$upper, nrow = length(xg), ncol = length(yg), byrow = FALSE)
LO <- matrix(grid3d$lower, nrow = length(xg), ncol = length(yg), byrow = FALSE)
MU <- matrix(overall_effect, nrow = length(xg), ncol = length(yg)) # flat mean plane
# 3D plot: points + mean plane + upper/lower "funnel" surfaces
plot_ly() |>
add_markers(
data = dat_plot,
x = ~log10(number_of_subjects),
y = ~voxel_scaled,
z = ~es_mean,
marker = list(size = 2, opacity = 0.25),
name = "Images"
) |>
add_surface(
x = xg, y = yg, z = MU,
opacity = 0.35,
name = "Overall effect (mu)"
) |>
add_surface(
x = xg, y = yg, z = UP,
opacity = 0.20,
showscale = FALSE,
name = "Upper (mu + k*SD)"
) |>
add_surface(
x = xg, y = yg, z = LO,
opacity = 0.20,
showscale = FALSE,
name = "Lower (mu - k*SD)"
) |>
layout(
scene = list(
xaxis = list(title = "log10(sample size)"),
yaxis = list(title = "voxel_scaled (nonzero voxels / 1e5)"),
zaxis = list(title = "Effect size mean")
)
)extract_var_params <- function(fit) {
sigma2 <- sigma(fit)^2
tau2 <- as.numeric(VarCorr(fit)[1, "Variance"])
expParams <- intervals(fit)$varStruct[, 2]
delta_n <- expParams[1]
delta_v <- expParams[2]
data.frame(
sigma2 = sigma2,
tau2 = tau2,
delta_n = delta_n,
delta_v = delta_v
)
}
m_full <- lme(
es_mean ~ 1,
random = ~ 1 | collection_id,
weights = varComb(
varPower(form = ~ number_of_subjects),
varPower(form = ~ voxel_scaled)
),
data = dat,
method = "REML"
)
dat_clean <- subset(dat, !outlier)
m_clean <- lme(
es_mean ~ 1,
random = ~ 1 | collection_id,
weights = varComb(
varPower(form = ~ number_of_subjects),
varPower(form = ~ voxel_scaled)
),
data = dat_clean,
method = "REML"
)
tab_var <- rbind(
cbind(Dataset = "Before outlier exclusion",
extract_var_params(m_full)),
cbind(Dataset = "After outlier exclusion",
extract_var_params(m_clean))
)
# nicer formatting
tab_var$sigma2 <- round(tab_var$sigma2, 4)
tab_var$tau2 <- round(tab_var$tau2, 4)
tab_var$delta_n <- round(tab_var$delta_n, 3)
tab_var$delta_v <- round(tab_var$delta_v, 3)
kable(
tab_var,
col.names = c(
"Dataset",
expression(sigma^2),
expression(tau^2),
expression(delta[n]),
expression(delta[v])
),
caption = "Estimated variance components and heteroskedasticity exponents before and after outlier exclusion."
)| Dataset | sigma^2 | tau^2 | delta[n] | delta[v] | |
|---|---|---|---|---|---|
| A.power | Before outlier exclusion | 89636.1198 | 0.0282 | -0.185 | -6.836 |
| A.power1 | After outlier exclusion | 0.3866 | 0.0049 | -0.147 | -1.431 |
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
summary_tbl <- tibble(
Metric = c(
"Total maps",
"Unique collections",
"Median sample size",
"Sample size SD",
"Median voxel coverage",
"Effect size median",
"Effect size IQR",
"Model SE median",
"Outliers (%)"
),
Value = c(
nrow(dat),
n_distinct(dat$collection_id),
median(dat$number_of_subjects, na.rm = TRUE),
sd(dat$number_of_subjects, na.rm = TRUE),
median(dat$non_zero_voxels_section3.3, na.rm = TRUE),
median(dat$es_mean, na.rm = TRUE),
IQR(dat$es_mean, na.rm = TRUE),
median(dat$se_model, na.rm = TRUE),
100 * mean(dat$outlier, na.rm = TRUE)
)
)
summary_tbl %>%
mutate(Value = round(Value, 3)) %>%
kable(
col.names = c("Preprocessing summary", "Value"),
align = c("l", "r")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = FALSE
)| Preprocessing summary | Value |
|---|---|
| Total maps | 12188.000 |
| Unique collections | 3255.000 |
| Median sample size | 30.000 |
| Sample size SD | 163.537 |
| Median voxel coverage | 262368.000 |
| Effect size median | 0.003 |
| Effect size IQR | 0.078 |
| Model SE median | 0.275 |
| Outliers (%) | 0.287 |